Skip to main content

Text Classification

Feature

feature: a measureable variable that is distinctive of something we want to model

If we wanna classify something, features are the good things we measure to make the classification.

  • in neural network, features may not nameable, but they are still there.
  • we need various, heterogeneous features to make a good model.
  • Values for several features of an observation can be put into a single vector.
  • features should be userful in discriminating between categories

Example: Feature for diagnosing Alzheimer’s disease

Alzheimer’s disease involves atrophying of the brain. It has some features:

  • excessive pauses (acoustic disfluencies)
  • excessive word type repetitions
  • simplistic or short sentences, functional words might be dropped

To diagnose Alzheimer’s disease, we can use these features to classify the patient. One might measure:

  • proportion of utterance spent in silence
  • entropy of the distribution of word types
  • number of word tokens in the sentence
  • number of prepositions and determiners in the sentence

Example: Feature for Sentiment analysis

Sentiment analysis can involve detect:

  • stress or frustration in a conversation
  • interest, confusion or preferences

some useful features:

  • trigrams
  • first person pronouns

Pre-processing

pre-processing involves preparing your data to make feature extraction easier or more valid.

  • e.g. sometimes punctuation likes to press up against words, and we want to separate them.
  • but there is no perfect pre-processing, it depends on the task.
    • noise reduction might removes some information
    • consistency is important, but not always possible

Part of speech (PoS)

part-of-speech: lexical category or morphological class.

  • linguists like to group words according their distribution in grammatical sentences.

content categories

function categories

Tagging

tagging: the proceess of assigning a part-of-speech to each word in a sequence.

tagging as classification; we have a sequence of observation and decide the best assignment of hidden variable. (i.e. back can be noun, verb, ...)